Intro to R/Rstudio

Brandon Greenawalt
Data Science Technologist





Center for Social Research

2017-05-17

R

  • R is fast becoming a general programming environment

    • Not just for statistics anymore!
  • R is an object-oriented programming language.

Is R Really Different?

  • Cost

    • You cannot beat free
  • Power

    • Object-oriented programming language with near-infinite flexability
  • Open

    • Anyone can contribute to it and/or look at function code
  • Maintainers/Package Creators

    • The people who develop methods create packages for those methods

RStudio

  • RStudio, Inc is a company dedicated to all things R

  • Rstudio is also the single largest contributor to R

  • It was founded by J.J. Allaire

  • Hadley Wickham is the Chief Scientist

  • It is also an integrated development environment (IDE) for R

Scripting

Before RStudio, there were other options beside using the console.

- Notepad++ 
- Tinn-R
- R-Commander

In scripting with RStudio, you are getting:

- Code completion (use tab to autocomplete anything)
- Code highlighing
- Code diagnostics/warnings
- Code snippets (tab for apply and loops)
- Easily accessible help files (F1 on any function)
- Code tidying (Ctrl + Shift + A)
- More shortcuts than you can learn (Alt + Shift + K)
- Automatic pairing of closures (or...ruining your typing)

File Types

In addition to R scripts, RStudio offers optimized editing for:

  • Text
    • JS
    • CSS
    • Python
  • Markdown
  • Presentation
  • C++

RStudio

RStudio is also an excellent tool for reproducible research.

  • Project management
  • Package building and freezing
  • Document generation

Projects

All tabs opened will remain open when you revisit the project.

You can have multiple projects running at the same time

  • i.e. multiple RStudio instances

Help you get more organized.

Help you get more reproducible.

File/New Project

Setup a project on your own

The Basics

Everything in R is an object.

You must create an object and you can then call on the object.

  • Always be sure to name objects something other than function names!
numList = 1:5

numList
## [1] 1 2 3 4 5
  • <- is the classic assignment operator.
numList * 5
## [1]  5 10 15 20 25

Lets Try Section 1

Object Types

R has many different kinds of objects:

Item

  • Numeric

  • Character

  • Factor/ordered

Data

  • Data frame

  • Matrix

  • List

  • Tibble

The Index

Because R creates objects, each object can be referenced through an index.

Like many other languages, an object’s index is generally accessed using []:

numList[1:3]
## [1] 1 2 3
numList[1:3] * 5
## [1]  5 10 15

For named objects, we can use the $:

head(mtcars$mpg)
## [1] 21.0 21.0 22.8 21.4 18.7 18.1

A Big Index Hint

Just like matrix algebra and dimensional lumber – obj[rows, columns]

mtcars[1, ]
##           mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4    4
head(mtcars[, 1])
## [1] 21.0 21.0 22.8 21.4 18.7 18.1
mtcars[1, 1]
## [1] 21

Lets Try Section 2

Operators and Math Functions

Like any other language (or program, for that matter), R has the ability to use operators:

mtcars$mpg[mtcars$cyl == 6 | mtcars$cyl == 8 & mtcars$hp >= 146]
##  [1] 21.0 21.0 21.4 18.7 18.1 14.3 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7
## [15] 15.5 15.2 13.3 19.2 15.8 19.7 15.0

And math functions:

sqrt((2 + 2)^2 * (7 / (2 - 1))) * pi
## [1] 33.24749

Lets Try Section 3

Basic Functions

Even with all of the packages that R has, base R is still extremely powerful by itself.

str(numList)
##  int [1:5] 1 2 3 4 5
summary(numList)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       2       3       3       4       5
mean(numList)
## [1] 3

Basic Analyses

cor(mtcars$mpg, mtcars$wt)
## [1] -0.8676594
lm(mpg ~ wt, data = mtcars)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

Basic Plotting

plot(mtcars$wt, mtcars$mpg, pch = 19)

Combining Functions

R allows you to combine functions:

plot(mtcars$wt, mtcars$mpg, pch = 19)
lines(lowess(mtcars$wt, mtcars$mpg), col = "#FF6600", lwd = 2)
abline(lm(mpg ~ wt, data = mtcars), col = "#0099ff", lwd = 2)

Lets Try Section 4

CRAN

The Comprehensive R Archive Network is the “official” package repository for R.

  • There are currently 10577 packages on CRAN.
  • For perspective, R has 200x more functions than SAS.

Finding Information

CRAN Task Views allow you to see a variety of functions associated with topics.

Task View Examples Example Packages
Econometrics wbstats & plm
Finance quantmod & urca
Machine Learning rpart & caret
Natural Language Processing tm & koRpus
Psychometrics lavaan & mirt
Spatial sp & rgdal
Time Series zoo & forecast

Installing packages

From CRAN

install.packages(c("devtools", "dplyr"))

From GitHub:

devtools::install_github("hadley/httr")
  • The devtools package also has the ability to install packages from other repositories (e.g., bitbucket, svn).

Tidyverse

install.packages("tidyverse")
library(tidyverse)

Note: Not for CRC cluster use

Tidyverse - Core

Package Use Case
ggplot2 data visualisation
dplyr data manipulation
tidyr data tidying
readr data import
purrr functional programming
tibble tibbles, a modern re-imagining of data frames

Tidyverse - data manipulation

Package Use Case
hms times
stringr strings
lubridate date/times
forcats factors

Tidyverse - data import

Package Use Case
DBI databases
haven SPSS, SAS and Stata files
httr web apis
jsonlite JSON
readxl .xls and .xlsx files
rvest web scraping
xml2 XML

Tidyverse - modeling

Package Use Case
modelr simple modelling within a pipeline
broom turning models into tidy data

Install the Tidyverse on your own





install.packages("tidyverse")
library(tidyverse)

Modern Approaches For Data Wrangling

We saw a glimpse of what base R has to offer in terms of data manipulation.

As powerful as the indexing approach may be, it can often be messy and slightly confusing to someone who may be interested in using your code (or the future you).

  • Because it is object-oriented, it is inherently more powerful than traditional stats programs.

An Example

### NICE R DATA ###

# numeric indexes; not conducive to readibility or reproducibility
newData = mtcars[, 1:4]

# explicitly by name; fine if only a handful; not pretty
newData = mtcars[, c('mpg','cyl', 'disp', 'hp')]

### MEAN REAL DATA ###

# two step with grep (searching with regular expressions)
cols = c('ID', paste0('X', 1:10), 'var1', 'var2', 
         grep("^Merc[0-9]+", colnames(oldData), value = TRUE))

newData = oldData[, cols]

# or via subset
newData = subset(oldData, select = cols)

More

What if you also want observations where Z is Yes, Q is No, and only the last 50 of those results, ordered by var1 (descending)?

# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No', ]
newData = tail(newData, 50)
newData = newdata[order(newdata$var1, decreasing = TRUE), ]

And this is for fairly straightforward operations.

dplyr

The dplyr package was created to make data manipulation easier.

newData = oldData %>% 
  filter(Z == 'Yes', Q == 'No') %>% 
  select(num_range('X', 1:10), contains('var'), starts_with('Merc')) %>% 
  tail(50) %>% 
  arrange(desc(var1))

Other Handy Functions

mtcars %>% 
  filter(am == 0) %>% # Automatic transmission
  select(mpg, cyl, hp, wt) %>% 
  mutate(rawWeight = wt * 1000) %>% 
  group_by(cyl) %>% 
  summarize_all(funs(mean)) 
## # A tibble: 3 × 5
##     cyl    mpg        hp       wt rawWeight
##   <dbl>  <dbl>     <dbl>    <dbl>     <dbl>
## 1     4 22.900  84.66667 2.935000  2935.000
## 2     6 19.125 115.25000 3.388750  3388.750
## 3     8 15.050 194.16667 4.104083  4104.083

Lets Try Section 5

coalesce

x = c(1, 2, NA, NA, 5, 6, NA, 8, NA, NA)

y = c(NA, NA, 3, 4, NA, NA, NA, NA, NA, NA)

z = c(NA, NA, NA, NA, NA, NA, 7, NA, 9, 10 )

coalesce(x, y, z)
##  [1]  1  2  3  4  5  6  7  8  9 10

Lets Try Section 6

A Quick Word on %>%

In the previous snippet, you hopefully noticed the %>%.

It is included in dplyr, but it originates in magrittr.

It is pronounced as pipe and is functionally equivalent to the Unix |

  • Check out magrittr for other pipes.

Why?

Old-school R:

ceiling(mean(abs(sample(-100:100, 50))))

Piping:

-100:100 %>% 
  sample(50) %>% 
  abs %>% 
  mean %>% 
  ceiling

Both are valid, but one is just a bit easier for human eyes and easier to code.

Just Scratching The Surface

We have only really seen the tip of the iceberg with regard to what R has to offer.

  • We did not even talk about all its analytical capabilities – just know that it does anything!

Do take some time to look through the CRAN Task Views.

The RBloggers website always has new and neat stuff.

Daily and weekly trending repositories on GitHub are also enlightening.

## 
## If you think you can learn all of R, you are wrong. For the foreseeable
## future you will not even be able to keep up with the new additions.
##    -- Patrick Burns (Inferno-ish R)
##       CambR User Group Meeting, Cambridge (May 2012)

Example

library(plotly)

plot_ly(economics, x = ~date, y = ~uempmed) %>%
  add_trace(y = ~fitted(loess(uempmed ~ as.numeric(date))), x = ~date) %>%
  layout(title = "Median duration of unemployment (in weeks)", showlegend = FALSE) %>%
  dplyr::filter(uempmed == max(uempmed)) %>%
  layout(annotations = list(x = ~date, y = ~uempmed, text = "Peak", showarrow = T))

Example

Example

library(plotly)
df <- read.csv('https://raw.githubusercontent.com/plotly/datasets/master/2011_february_us_airport_traffic.csv')
df$pop = maps::us.cities$pop[match(paste(df$city, df$state), maps::us.cities$name)]
df$hover <- with(df, paste(airport, city, '<br>',
                           "Population: ", pop, '<br>', 
                           "Arrivals: ", cnt))

# marker styling
m <- list(
  colorbar = list(title = "Incoming flights February 2011"),
  size = scales::rescale(df$pop, c(5, 20)), opacity = 0.5, border='rgba(0,0,0,0)'
)

# geo styling
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showland = TRUE,
  landcolor = toRGB("gray95"),
  subunitcolor = toRGB("gray85"),
  countrycolor = toRGB("gray85"),
  countrywidth = 0.5,
  subunitwidth = 0.5,
  bgcolor='rgba(0,0,0,0)'
)

plot_ly(df, lat = ~lat, lon = ~long, text = ~hover, color = ~cnt, marker=m, 
        type = 'scattergeo', locationmode = 'USA-states', mode = 'markers', colors='RdBu',
         width=1000) %>%
  layout(title = 'Most trafficked US airports<br>(Hover for airport)', geo = g,
         paper_bgcolor='rgba(0,0,0,0)',
         plot_bgcolor='rgba(0,0,0,0)',
         font=list(color=toRGB("gray85"))
         )

Example

Last One…

library(ggplot2)

p = ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(aes(text = paste("Transmission:", as.factor(am))), size = 2) +
  geom_smooth(aes(colour = as.ordered(cyl), fill = as.ordered(cyl)), 
              show.legend = FALSE) + 
  facet_grid(. ~ cyl) +
  scale_color_brewer(palette = "Dark2") +
  scale_fill_brewer(palette = "Dark2") +
  #scale_colour_discrete(name = "Cylinders") +
  lazerhawk::theme_trueMinimal()

ggplotly(p)

Last One…

Interactive Tables

DT::datatable(head(mtcars1), filter = "top")

Rmarkdown

What is RMarkdown?

Markdown is a markup language.

  • Allows for easier web-based documentation.
  • Not necessary to know html.
  • Lots of things will use it.
  • Rmarkdown is a flavor.

Now one can intermingle R with markdown, html, css, JavaScript, \(\LaTeX\) and others resulting in a variety of products.

Rmarkdown

Rstudio and Rmarkdown make it easy to construct:

  • html, pdf, MS Word documents
  • presentations (like this one)
  • dashboards
  • notebooks
  • websites
  • other publications

Example

`r lmSum = summary(lm(mpg ~ wt, data = mtcars))

if (lmSum$coefficients[2, 4] < .05) {
  paste("Weight's coefficient of", 
        round(lmSum$coefficients[2], 3), 
        "is significant", sep = " ")
} else {paste("Weight's coefficient of", 
        round(lmSum$coefficients[2], 3), 
        "is not significant", sep = " ")}`
## [1] "Weight's coefficient of -5.344 is significant"

An Example In Practice

A good man once said:

You, my dear sir, are but a mere bootless beef-witted bugbear and I bid you a good day.

The Code

paste(sample(c('artless','bawdy','beslubbering','bootless'), 1), 
      sample(c('base-court','bat-fowling','beef-witted','beetle-headed'), 1),
      sample(c('apple-john','baggage','barnacle','bladder','boar-pig'), 1))
## [1] "beslubbering bat-fowling barnacle"

Cheat Sheets

RStudio wants everything to be easy for us as R users.

They provide a series of cheat sheets as reference material.

https://www.rstudio.com/resources/cheatsheets/

Cheat Sheets

Data Visualization

Data Wrangling

R Markdown

Package Development

Shiny

Trump Tweets

http://varianceexplained.org/r/trump-tweets/

A thanks to:

  • Seth Berry @ Mendoza College of Business

  • Michael Clark @ CSCAR, U of Mich

  • Anshumaan Bajpai @ Center for Social Research

  • Center for Digital Scholorship

Questions